Warning: mkdir(): No space left on device in /var/www/tg-me/post.php on line 37

Warning: file_put_contents(aCache/aDaily/post/LinghaoCh/--): Failed to open stream: No such file or directory in /var/www/tg-me/post.php on line 50
Parallel Experiments | Telegram Webview: LinghaoCh/938 -
Telegram Group & Telegram Channel
https://arxiv.org/abs/2305.18290 #llm #ai

今天深入学习了 DPO,再次感叹扎实的数学功底对 AI/ML Research 的重要性……

原始的 RLHF 是用 pairwise human preference data(A 和 B 哪个更好)去训练一个 reward model,然后用 RL 来训练主 policy model,objective 是 minimize negative log likelihood + regularization(比如 PPO 就是通过新旧 policy 之间的 KL Divergence 来做 regularization)。这样的缺点在于 RL 是出了名的难搞,而且还需要一个 critic model 来预测 reward,使得整个系统的复杂性很高。

DPO 的思路是,观察到 RLHF 的 objective 本质上是 minimize loss over (latent) reward function,通过一番 reparameterization 等数学推导,重新设计了一个 minimize loss over policy 的 objective,绕过了中间这个 reward model,让 gradient update 直接增加 policy model 生成 winner response 的概率并降低 loser response 的概率,大幅简化了流程。

拓展阅读:
- KTO: 更进一步,不需要 pairwise comparison,只用对 individual example 的 upvote/downvote 也可以学习到 preference。
- IPO: 解决 DPO 容易 overfit 的问题。



tg-me.com/LinghaoCh/938
Create:
Last Update:

https://arxiv.org/abs/2305.18290 #llm #ai

今天深入学习了 DPO,再次感叹扎实的数学功底对 AI/ML Research 的重要性……

原始的 RLHF 是用 pairwise human preference data(A 和 B 哪个更好)去训练一个 reward model,然后用 RL 来训练主 policy model,objective 是 minimize negative log likelihood + regularization(比如 PPO 就是通过新旧 policy 之间的 KL Divergence 来做 regularization)。这样的缺点在于 RL 是出了名的难搞,而且还需要一个 critic model 来预测 reward,使得整个系统的复杂性很高。

DPO 的思路是,观察到 RLHF 的 objective 本质上是 minimize loss over (latent) reward function,通过一番 reparameterization 等数学推导,重新设计了一个 minimize loss over policy 的 objective,绕过了中间这个 reward model,让 gradient update 直接增加 policy model 生成 winner response 的概率并降低 loser response 的概率,大幅简化了流程。

拓展阅读:
- KTO: 更进一步,不需要 pairwise comparison,只用对 individual example 的 upvote/downvote 也可以学习到 preference。
- IPO: 解决 DPO 容易 overfit 的问题。

BY Parallel Experiments




Share with your friend now:
tg-me.com/LinghaoCh/938

View MORE
Open in Telegram


Parallel Experiments Telegram | DID YOU KNOW?

Date: |

Pinterest (PINS) Stock Sinks As Market Gains

Pinterest (PINS) closed at $71.75 in the latest trading session, marking a -0.18% move from the prior day. This change lagged the S&P 500's daily gain of 0.1%. Meanwhile, the Dow gained 0.9%, and the Nasdaq, a tech-heavy index, lost 0.59%. Heading into today, shares of the digital pinboard and shopping tool company had lost 17.41% over the past month, lagging the Computer and Technology sector's loss of 5.38% and the S&P 500's gain of 0.71% in that time. Investors will be hoping for strength from PINS as it approaches its next earnings release. The company is expected to report EPS of $0.07, up 170% from the prior-year quarter. Our most recent consensus estimate is calling for quarterly revenue of $467.87 million, up 72.05% from the year-ago period.

Spiking bond yields driving sharp losses in tech stocks

A spike in interest rates since the start of the year has accelerated a rotation out of high-growth technology stocks and into value stocks poised to benefit from a reopening of the economy. The Nasdaq has fallen more than 10% over the past month as the Dow has soared to record highs, with a spike in the 10-year US Treasury yield acting as the main catalyst. It recently surged to a cycle high of more than 1.60% after starting the year below 1%. But according to Jim Paulsen, the Leuthold Group's chief investment strategist, rising interest rates do not represent a long-term threat to the stock market. Paulsen expects the 10-year yield to cross 2% by the end of the year. A spike in interest rates and its impact on the stock market depends on the economic backdrop, according to Paulsen. Rising interest rates amid a strengthening economy "may prove no challenge at all for stocks," Paulsen said.

Parallel Experiments from pl


Telegram Parallel Experiments
FROM USA